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Abstract 

This work demonstrates how partial evaluation can be put to practical use 
in the domain of high-performance numerical computation. I have devel- 
oped a technique for performing partial evaluation by using placeholders to 
propagate intermediate results, and have implemented a prototype compiler 
based on this technique. For an important class of numerical programs, this 
compiler improves performance by an order of magnitude over conventional 
compilation techniques. I also show that by eliminating inherently sequential 
data-structure references, partial evaluation exposes the low-level parallelism 
inherent in a computation. I have implemented a parallel program genera- 
tor, as well as several analysis programs that study the tradeoffs involved in 
the design of an architecture that can effectively utilize this parallelism. I 
present these results using the 9-body gravitational attraction problem as an 
example. 
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fulfillment of the requirements for the degree of Master of Science. 
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Chapter 1 



Introduction 



Scientists are faced with a dilemma: Either they can write programs that 
express their understanding of a problem, but which do not execute effi- 
ciently; or they can write programs that computers can execute efficiently, 
but which are difficult to write and difficult to understand. This report ex- 
plores how partial and symbolic evaluation can be combined with traditional 
compilation techniques to provide a solution to this dilemma. 

My work demonstrates how partial evaluation can be used to reduce the 
performance penalties associated with abstraction in numerical programs. I 
have developed a technique for performing partial evaluation by using place- 
holders to propagate intermediate results, and have implemented a prototype 
compiler based on this technique. For an important class of numerical pro- 
grams, this compiler improves performance by an order of magnitude over 
conventional compilation techniques. Experiments with this compiler show 
that by eliminating inherently sequential data-structure references, partial 
evaluation exposes the low-level parallelism inherent in a computation. I 
have implemented a parallel program generator, as well as several analysis 
programs that study the tradeoffs involved in the design of a parallel archi- 
tecture that can effectively utilize this parallelism. 



1.1 Overview 

Numerical programs written in high-level languages spend much of their 
time manipulating data structures rather than performing numerical com- 
putations. By using information that is available at compile time about 
the particular problem that a program will solve, it is possible to perform 
some of these data structure manipulations in advance, creating a special- 
ized, partially-evaluated program. In many important scientific applications, 
there is enough information available for the compiler to derive the under- 
lying numerical computation from the high-level program, entirely eliminat- 
ing compound data structures and abstractions. In Chapter 2, I present 
a technique for performing the partial evaluation required to create such a 
specialized program by executing the program symbolically at compile time. 

I have implemented a prototype compiler based on this technique. This 
compiler combines the specialization provided by partial evaluation with tra- 
ditional compilation techniques, allowing optimizations such as constant fold- 
ing, symbolic simplification, and dead-code elimination to be performed on 
the low-level computation itself, without interference from abstraction mech- 
anisms or compound data structures. In Chapter 3, 1 describe the application 
of this compiler to several real-world scientific applications, and present per- 
formance measurements. 

Experiments with the prototype compiler have shown that eliminating 
abstractions not only improves performance, but also exposes the low-level 
parallelism inherent in a computation. Eliminating inherently sequential 
data structure references imposed by the way the programmer thinks about 
a problem allows the compiler to re-order the computation so as to allow 
intermediate results to be used immediately in other parts of the computa- 
tion. I have implemented a parallel program scheduler, and used it to study 
the nature of the parallelism available in the 9-body gravitational attraction 
problem. Chapter 4 presents these results, as well as a strategy for effec- 
tively utilizing this parallelism by generating a specialized routing network 
for a particular application. 



1.2 Background: Previous Work 



1.2.1 Abstraction in Scientific Computation 

Recent work[Halfant] has shown that through the use of the abstraction 
mechanisms of high-level languages, it is possible to create more powerful 
problem- solving strategies than would otherwise be feasible. Unfortunately, 
the performance penalties associated with abstraction have precluded the use 
of high-level languages such as Lisp for computationally intensive numerical 
applications. The challenge facing high-level language compilers is to reduce 
these performance penalties, thereby fundamentally changing the way that 
scientists write programs. 



1.2.2 Specialization 

High-performance numerical programs often generate specialized routines 
in order to improve performance. For example, the SPICE analog circuit 
simulator[Nagel] generates specialized matrix manipulation routines to take 
advantage of the sparse matrices that arise out of network node equations. 
Similarly, "straight-line" implementations of the Fast Fourier Transform (FFT) 
may be generated, such that all array indices are computed in advance. These 
specialized routines are typically created using a hand-crafted generator pro- 
gram that operates over a relatively small segment of a large problem. I 
present a generalization of this technique that allows generator programs to 
be expressed abstractly, enabling specialization to occur over a larger portion 
of the overall problem than would otherwise be practical. 



1.2.3 Partial Evaluation 

The process through which specialization is achieved is known as partial eval- 
uation. Partial evaluation has been widely written on, but has not yet found 
much practical use. In particular, [Schooler] did some interesting work on au- 
tomatically generating a compiler by partially evaluating the interpreter, and 
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Chapter 2 



Partial Evaluation 



I present a technique for performing partial evaluation by executing a pro- 
gram symbolically at compile time. The key idea is that programs can be 
divided into "data-independent" regions in which the flow of control can be 
predicted at compile time. The code for these regions may be evaluated in 
advance, eliminating data abstractions and compound data structures. 



2.1 Symbolic Execution 

The goal of my compiler is to figure out in advance what numerical oper- 
ations a program will need to perform to produce its result. Interpreted 
programs determine this at run time by manipulating abstract data struc- 
tures. By tracing the execution of an interpreted program, one can observe 
the sequence of numerical operations that it performs. In the special case 
of a data-independent computation, this sequence of numerical operations is 
independent of the numerical values of the input data. Thus, one way for a 
compiler to predict in advance what numerical operations a data-independent 
program will perform is to simply run the program at compile time. 

In order to execute a program at compile time, it is necessary to create 
the data structures that the program will manipulate. The values for some 
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of the program's numerical inputs will be available at compile time, and can 
be included directly in their normal places in the data structures. How- 
ever, numerical values for some pieces of data will not be available until run 
time. These missing values are represented symbolically using placeholders. 
A placeholder is a data structure that is used to represent a specific piece of 
missing data. Each placeholder contains whatever information is available 
about the piece of data that it represents, including its type and its point of 
origin. 

During the compile-time execution, a numerical operation that receives 
numerical arguments executes normally, producing a numerical result. Such 
an operation does need not be included in the compiled program, since its 
results have already been computed. However, a numerical operation that 
receives a placeholder as an argument must be delayed until run time, when 
the value represented by the placeholder is available. An instruction that 
will perform the delayed operation at run time is added to the compiled 
program, and a new placeholder is created to represent the result of the 
operation. This new placeholder propagates through the compile time data 
structures, eventually becoming an input to another numerical operation, 
which will in turn be delayed until run time. 

Compile-time interpretation produces result data structures which them- 
selves contain placeholders. These result placeholders represent the final 
results of the computations that have been delayed until run time. The 
compiled program takes as arguments numerical values for each of the place- 
holders in the input data structures, and executes the delayed instructions 
to produce numerical values for the result placeholders. 

The power of this approach lies in the fact that placeholders are manipu- 
lated as if they are the actual data values that they represent. Placeholders 
propagate through data abstractions, allowing abstractions to be resolved 
at compile time. For example, (cons x y) will simply create a cons cell, 
even if x happens to be a placeholder. Later in the program, when this cons 
cell is referenced using CAR, the placeholder will be returned. Thus, data 
abstractions utilized in the specification of a program have no effect on the 
run time performance of the instruction stream generated by the compiler. 
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2.1.1 An Example 

To illustrate the compilation process for data-independent programs, consider 
the compile time execution of the sum-of-squares program shown below, for 
an application in which the input is known to be a list of three floating-point 
numbers, the last of which is always 3.14. The compiled program is shown 
in Figure 2.1. 

==> (define (square x) (* x x)) 
--> (define (sum-of-squares L) 

(apply + (map square L))) 

==> (sum-of-squares (list (make-placeholder 'floating-point) 

(make-placeholder 'floating-point) 
3.14)) 



1 COMPILED PROGRAM (SPECIALIZED SUM-OF-SQUARES): 

2 ;;The numbers refer to the placeholder numbers 

3 (INPUT 1) 

4 (INPUT 2) 
5 

6 (ASSIGN 3 (Floating-Point-* (FETCH 1) (FETCH 1))) 

7 (ASSIGN 4 (Floating-Point-* (FETCH 2) (FETCH 2))) 

8 (ASSIGN 5 (Floating-Point-* (FETCH 3) (FETCH 4) 9.8596)) 

9 (RESULT 5) 

Figure 2.1: Specialized Code for Sum-of-squares. Notice how the squaring of 
3.14 to produce 9.8596 took place at compile time. 
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At compile-time, the program begins executing normally, as shown below: 

Entering: (sum- of -squares (<placeholder-l> <placeholder-2> 3.14)) 
Entering: (apply + (map square 

( placeholder- 1> <placeholder-2> 3.14))) 
Entering: (map square (<placeholder-l> <placeholder-2> 3.14)) 
Entering: (square <placeholder-l>) 

Entering: (* <placeholder-l> <placeholder-l>) 

At this point, the multiply operation can not proceed normally, because its 
arguments are placeholders. A new placeholder, <placeholder-3> is created 
to represent the result of the multiplication. Since both arguments to mul- 
tiply were of type floating-point, the result, <placeholder-3>, will also have 
type floating-point. A floating-point multiply operation is added to the com- 
piled program as line number 6. Execution resumes with multiply returning 
<placeholder-3> as its result: 

Returning: <placeholder-3> <»» (* placeholder- 1> <placeholder-l>) 

Returning: <placeholder-3> <»* (SQUARE placeholder- 1>) 

Entering: (SQUARE <placeholder-2>) 

Entering: (* <placeholder-2> <placeholder-2>) 

Again, the multiply operation must be delayed until run time. <placeholder- 
4> is created, and the multiply operation is added to the compiled program 
as line 7. Execution resumes with multiply returning <placeholder-4> as its 
result: 

Returning: <placeholder-4> <=« (* <placeholder-2> <placeholder-2>) 

Returning: <placeholder-4> <=* (square <placeholder-2>) 

Entering: (square 3.14) 

Entering: (* 3.14 3.14) ;This multiply takes 

Returning: 9.8596 <»■ (* 3.14 3.14) ; place at compile time 

Returning: 9.8596 <■* (square 3.14) 

Returning: (<placeholder-3> <placeholder-4> 9.8596) <»* (map ...) 

Entering: (+ <placeholder-3> <placeholder-4> 9.8596) 
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The addition operation is delayed until run time as line 8 in the compiled 
program. < placeholder- 5 > is created to represent the result: 

Returning: <placeholder-5> <=« (+ placeholder- 3> ...) 
Returning: <placeholder-5> <=« (apply + (map square ...)) 
Returning: <placeholder-5> <== (sum-of -squares ...) 

<Placeholder-5> represents the result of the computations that will occur 
at run time. In summary, the partial evaluation process is quite simple: The 
program executes normally, except that operations whose inputs are not 
available are delayed until run time by adding instructions to the compiled 
program. 
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2.2 Implementation 

Compile time interpretation is applicable to a variety of programming lan- 
guages. The prototype compiler and the examples in this report were im- 
plemented for the Scheme dialect of Lisp. Lisp is particularly amenable to 
these techniques in that all data abstraction mechanisms built-in to the lan- 
guage are type-independent. This allows placeholders to propagate through 
data structures that would normally have contained numbers. This approach 
would be difficult to implement in a strictly typed language such as Pascal. 

Any Lisp interpreter can be converted for use as a compile time inter- 
preter by adding support for the placeholder data-type. The lowest-level 
numerical primitives (those that depend on the actual value represented by a 
placeholder) must be modified 1 to check whether any of their arguments are 
placeholders. If any of the arguments are placeholders, the instruction adds 
itself to the list of instructions that must be executed at run time, and then 
returns a new placeholder to represent its result. Using placeholders to rep- 
resent intermediate results has proven to be a simple and elegant technique 
for performing partial evaluation. 



2.2.1 Interface to the Lisp System 

The compiled data-independent programs are incorporated into the Lisp sys- 
tem as primitive procedures. 2 The Lisp system provides the control struc- 
ture, executing calls to the optimized data-independent primitives as needed. 
In this way, the data-dependent portions of a program are handled via the 
powerful control mechanisms of Lisp, while the data-independent numerical 
computations are compiled into high-performance primitives. 

Whereas the input and output of the original program was via abstract 



x The ADVICE feature provided by MIT-SCHEME is a convenient mechanism for imple- 
menting this extension. Alternatively, variables such as + can be redefined as extended 
arithmetic procedures. 

2 Primitive procedures, such as +, are the lowest-level routines provided by the system. 
In the particular case of CScheme, these procedures are expressed in C and then added to 
the source code of the Scheme system itself. 
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Lisp data-structures, the input and output of the compiled primitives is via 
vectors of numerical values for the input and result placeholders. My compiler 
automatically generates a set of interface procedures to support the input 
and output data structures of the original Scheme program. These interface 
procedures extract values for the input placeholders from the run time data 
structures, and create result data structures that contain the values computed 
for the result placeholders. 

Programs that manipulate the input and output placeholder vectors di- 
rectly avoid the performance penalties associated with creating and refer- 
encing the interface data structures. To support this style of programming, 
it is desirable to arrange the ordering of the placeholder values within the 
input and output vectors such that the result vector produced by one prim- 
itive can be directly used as the input vector for the next primitive to be 
invoked. The prototype compiler automatically arranges for this ordering in 
situations where the input and output Lisp data structures have the same 
format. This has proven to be a particularly effective mechanism for creating 
high-performance iterative loops, in which the outputs of one iteration serve 
as the inputs for the next iteration. 



2.3 Code Generation 



The prototype compiler expresses the compiled program in the form of a 
low-level C program. Register allocation techniques are used to allocate run 
time storage for placeholder values. If a value is to be used more than once, 
it is stored in memory until it is no longer needed. However, if a placeholder 
value is only referenced once, the computation to produce that value is in- 
line coded directly at the place where the reference would have occured. This 
generates large numerical expressions, giving the C compiler an opportunity 
to efficiently utilize the hardware registers. 

Traditional compilation techniques are used to further improve the per- 
formance of the partially evaluated program. From the standpoint of the 
high-level language program, the partial evaluation process performs many 
optimizations, including common subexpression elimination of data refer- 
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ences and constant propagation. However, the elimination of compound 
data structures exposes the underlying numerical computation, presenting 
new opportunities for optimization. 

Symbolic manipulation is used to simplify the numerical expressions pro- 
duced by partial evaluation. For example, expressions such as (+ y (* -1 x)) 
are simplified into (- y x). Opportunities for simplifications often arise 
when high-level data structure operations are combined, as in the subtract- 
vectors operation shown in Figure 2.2. These optimizations are often not 
noticed by the programmer when they do not apply uniformly to all elements 
of a data structure, or when the operations being combined are located in 
logically separate portions of the program. 

(define (subtract-vectors a b) 
(add- vectors a 

(scale-vector -1 b))) 



Figure 2.2: Operations that are built by combining abstract operations can 
often be simplified through symbolic manipulation of the low-level computa- 
tion. In this version of subtract-vectors, the addition and scaling operations 
can be combined to form a subtraction. 

It is sometimes possible to determine the result of an operation even 
before all of the inputs are available. For example, (* x) can be easily 
reduced to 0, which may in turn allow the result of some other operation to 
be determined. Symbolic simplification increases the opportunities for this 
sort of optimization, particularly in situations in which can be derived as a 
result, as in (- x x). Constant folding is used to simplify instructions such 
as (* 5 x 6) into (* 30 x). 

The instruction simplification process sometimes eliminates all references 
to the value of a placeholder. For example, simplifying (- (FETCH 3) (FETCH 3)) 
might eliminate all references to the value of placeholder 3, allowing the com- 
putation of a value for placeholder 3 to be eliminated as well. Dead code 
elimination is used to eliminate those computations which are not needed to 
produce values for the result placeholders. In some cases, the programmer 
is able to declare that some of the result placeholders are not needed, allow- 
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2.4 Data-Dependent Control Flow 

Conditional branches present a problem for the compile-time interpreter: 
when the predicate of an IF statement evaluates to a placeholder, the inter- 
preter does not know whether to execute the consequent or the alternative. 
I refer to computations such as this as data-dependent because the flow of 
control can not be predicted at compile time. Data-dependent computations 
are generally composed of a set of data-independent regions, which are com- 
bined using control constructs that depend on intermediate results of the 
computation. The techniques described earlier in this Chapter provide a 
mechanism for identifying and optimizing the data-independent regions of a 
program. As mentioned earlier, one way to handle data-dependent programs 
is to compile only the data-independent regions, relying on the LISP system 
to determine the flow of control. 

In some situations it is not practical to use Lisp itself to control program 
execution. Some control functions can be moved into the compiled code by 
building mechanisms for handling data-dependent control transfers on top of 
the basic compile time interpreter. In contrast to the flexibility available in 
specifying data-independent routines, the mechanisms for specifying control 
flow are a limited subset of those available in Scheme. Data-dependent con- 
trol flow is a topic that deserves far deeper consideration, and is an excellent 
starting point for future work. However, the techniques described here are 
sufficient to represent most numerically-oriented scientific computations. 



2.4.1 Primitive Operations 

The simplest technique for handling control transfers is to hide them in prim- 
itive numerical operations. This is done for simple comparison operations, 
such as those associated with MIN, MAX, and ABS. Although these func- 
tions require a data-dependent control transfer to compute their result, they 
are viewed as simple numerical operations. At run time, these operations 
are supported either by hardware, or by high-performance subroutines. 3 At 



3 In fact, several modern ALU chips are providing operations such as MIN and ABS 
directly, thereby simplifying instruction sequencing. 
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compile time, when one of these primitive numerical operations encounters a 
placeholder, it returns a new placeholder to represent its result, just as any 
other low- level numerical operation does. 



2.4.2 Selection 

Mathematical functions are sometimes defined using conditional predicates 
to divide up the domain. For example, absolute value may be defined using 
conditional predicates as follows: 



. . . f x when x > 
abs{x) = | _ x 



when x < 



To allow the definition of these functions, it is necessary to introduce 
some form of control over the run time flow of program execution. The con- 
ditional control flow required to support division of the domain is relatively 
straightforward: Provided that the equation is not recursive, domain divi- 
sion can be viewed as selecting among various routines. In the prototype 
compiler, support for these selection operations is provided by augmenting 
the definition of the IF statement. (For the purposes of this document, this 
augmented version of IF is referred to as RUNTIME- IF). 

The predicate of a RUNTIME- IF statement is evaluated at compile time. 
When the predicate is not a placeholder, RUNTIME-IF behaves just as IF 
does, evaluating either the consequent or the alternative. However, when 
the predicate is a placeholder, both the consequent and the alternative are 
evaluated at compile time, generating compiled code for both possibilities. 
A conditional branch is generated to select at run time whether to execute 
the code associated with the consequent or the code associated with the 
alternative. 

The execution of both the consequent and the alternative at compile time 
places some restrictions on the structure of the program. Specifically, side- 
effects performed by both branches of the RUNTIME-IF will take place. Ideally, 
compilation techniques could be used to detect these side-effects and perform 
them at run time as well. For simplicity, I require that the programmer 
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declare that it is permissable to execute both branches at compile time. This 
is not a significant restriction since selection operations rarely perform side- 
effects. 



2.4.3 Iteration at Run Time 

The most important control paradigm in numerical programs is iteration. 
Many iterative loops can be executed at compile time, replacing the loop 
with the numerical operations it invokes. However, sometimes it is not prac- 
tical to evaluate loops at compile time. For example, when the number of 
iterations to be performed depends upon numerical results of the computa- 
tion, it is necessary to perform a conditional control transfer at run time. 
The compile time interpreter has been extended to provide support for iter- 
ative control transfers, relaxing the requirement that compiled programs be 
data-independent. 

Partial evaluation is used to specialize one iteration of the loop at compile 
time. The specialized version of the loop is then executed repeatedly. The 
ordinary procedure-call mechanism of Scheme can not be used directly to 
control this run-time looping, since procedure calls are eliminated by the 
partial evaluator at compile time. A new construct, called an entry-point, has 
been introduced to allow programs to perform control transfers at run time. 
As illustrated in Figure 2.3, defining an entry-point creates a continuation- 
like procedure, which causes a branch back to the entry- point when invoked. 

Several complications are introduced by the entry-point control mecha- 
nism. For example, if the structure of the data will be different for the next 
iteration through the loop, then the specialized code generated for the first 
iteration cannot be re-used to perform the next iteration. More sophisticated 
compilation techniques could help detect this situation. For now, I simply 
stipulate that by requesting partial evaluation of the body of a loop, the 
programmer is implicitly declaring that "the structure of the data present 
during the first iteration of the loop is identical to the structure that will be 
present during subsequent iterations." 
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(def ine-runtime-entry-point ((<var-namel> <initial-valuel>) 

(<var-name2> <initial-value2>) 

(lambda (return-to-entry-point) 

(let ((results (data-independent-computation))) 
(runtime-if (done-yet? results) 
results 
( ret urn- 1 o- ent ry-po int <new- var- value 1 > 

<new-var- value2> ...))))) 

Figure 2.3: Syntax of define-runtime-entry-point. An iterative entry-point is 
created in the run time instruction stream. Any time the return-to-entry- 
point procedure is called, the run time flow of program control returns to the 
start of the loop using the new variable values. 

Complex control mechanisms can be created by taking advantage of the 
fact that return-to-entry-point is a first-class Scheme procedure. For example, 
the EXPonent procedure may be denned to invoke a global *arithmetic-error* 
procedure whenever its argument is out of range. An adaptive integrator 
can rebind *arithmetic-error* to be a return-to-entry-point procedure that 
retries the computation using a different step size. 4 This mechanism allows 
other routines that use EXP to specify a different error behavior, thereby 
eliminating the need to have different exponent procedures for each error 
handling methodology, allowing EXP to become a general library function. 
Use of this technique is illustrated in Appendix B. 



4 The idea of using first-class return-to-entry-point procedures was motivated by the 
adaptive runge-kutta integrator in the Dynamicist's workbench, a large scientific software 
system being developed by the MIT Math and Computation Project. This integrator 
uses Scheme's call-with-current-continuation construct to implement the step-size changing 
technique. 
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2.4.4 Conditional Control Transfers: Implementation 
Details 

The variables specified in the entry-point variable list are used to pass in- 
formation from one iteration of the loop to the next. New placeholders are 
created to represent each of the loop variables. At run time, when the loop 
entry-point is first encountered, these variables are assigned their initial val- 
ues. Each time the loop is restarted, these variables take on new values as 
specified by the arguments to return-to-entry-point. 

Conditional transfers of control are implemented by combining the use of 
runtime-if with the entry-point control transfer mechanism. Implementing 
this combination is rather tricky: Runtime-if must execute both the con- 
sequent and the alternative in order to generate code for both branches. 
However, if the consequent invokes a return-to-entry-point procedure, the 
alternative would be bypassed. To avoid this problem, return-to-entry-point 
only causes a branch back to the entry-point at run time; at compile time, 
return-to-entry-point returns an abort tag to the nearest runtime-if, allow- 
ing the alternative to be executed. This form of conditional control transfer 
provides sufficient functionality to represent most loops that occur in numer- 
ical programs. In particular, the full functionality of the traditional WHILE 
and FOR/NEXT iteration constructs are supported. Figure 2.4 demonstrates 
the use of these mechanisms in the implementation of the iterative factorial 
function. 
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(define (Fac initial-n) 

(define-runtime-entry-point ((n initial-n) (result 1)) 
(lambda (fac-loop) 
(runtime-if (* n 0) 
result 
(fac-loop (- n 1) (* result n)))))) 

;;;Compiled Program for (fac <placeholder-i>) : 

(INPUT 1) ; placeholder- 1 represents the value of initial-n 

; Initialize the loop variables 

(ASSIGN 2 1) ;placeholder-2 is the loop-variable RESULT 

(ASSIGN 3 (FETCH 1)) ;placeholder-3 is the loop-variable N 

; The fac-loop entry point itself 

LABEL. 1: 

(ASSIGN 4 (» (FETCH 3) 0)) ; evaluate the runtime-if predicate 

(BRANCH (FETCH 4) CONSEQUENTS) ; select the consequent if predicate true 

ALTERNATIVE.!: 

(ASSIGN 5 (* (FETCH 2) (FETCH 3))) ;;;(* n result) 

(ASSIGN 6 (- (FETCH 2) i)) ; ; ; (- n 1) 

; Update loop variables for return to entry point 

(ASSIGN 2 (FETCH 5)) jresult 

(ASSIGN 3 (FETCH 6)) ;N 

(GOTO LABEL.l) ; return to entry point 

CONSEQUENT. 1 : 

(RESULT 2) ; the result is the value of placeholder 2 



Figure 2.4: An iterative implementation of the Factorial function using 
runtime-if and entry-points. Invoking the return-to- entry-point procedure 
FAC-LOOP causes a branch back to the entry point to appear in the compiled 
program. New placeholders are allocated for each loop variable, allowing 
arguments to be passed. 
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Chapter 3 
Applications 



I have applied these techniques to several numerically oriented scientific prob- 
lems. These problems were chosen from active research at MIT, providing a 
"real-world" demonstration of the applicability of partial evaluation to sci- 
entific computation. Scheme programs implementing the N-body algorithm, 
the solution to Duffing's equation, and the translation operator for the Mul- 
tipole Method[Zhao] were taken directly from code in use by researchers. In 
this chapter, I first describe how my compiler was used in each application, 
and then present performance measurements showing the performance gains 
attained by the compiler. 

I have implemented all of the partial evaluation techniques described in 
Chapter 2, including runtime-if and entry points. 1 In addition, the prototype 
compiler includes a constant folder, a dead-code eliminator, and a storage 
allocator. 2 The compiler generates low-level C programs which are incor- 
porated into the Scheme system as high-performance primitive operations. 

*Type propagation within placeholder values has not been fully implemented. For now, 
the values represented by placeholders are assumed to be of type 'floating-point. 

2 The dead-code eliminator and storage allocator used in the prototype compiler are 
restricted to purely data-independent programs. Eventually, I plan to merge these partial 
evaluation techniques into an existing Lisp compiler, thereby providing the full range of 
conventional compiler optimizations. 
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3.1 The N-body Problem 

3.1.1 Overview 

The N-body problem involves computing the trajectories of a collection of 
N particles which exert forces on each other. This very important problem 
arises in particle physics, astronomy, and space travel. For example, our solar 
system is a 10 particle system in which the forces are due to gravitational 
attraction. An N-body program written in Scheme by Gerry Sussman is used 
as a starting point for the compilation process. 3 This program, provided in 
Appendix C, makes liberal use of abstraction mechanisms, including higher- 
order procedures, lists, vectors, table lookups, and set operations. 

In order to simulate future particle motion, the program integrates the 
forces that the particles exert on each other over time. The integration-step 
routine takes an initial state of the planets, and produces a new state that 
corresponds to one time-step later. This routine is then repeated, thereby 
advancing the system forward in time. The prototype compiler was used to 
create a specialized version of the integration-step procedure. 

3.1.2 Describing the problem to the compiler 

The input data structures for the integration-step procedure were created 
at compile time. Since the positions and velocities of the planets are not 
known at compile time, these were represented using placeholders. The mass 
of each planet is known and does not vary appreciably with time, allowing 
numerical values for the masses to be included directly in the compile time 
data-structures. The compiler took advantage of the availability of these 
numerical values by performing some parts of the computation in advance. 
For example, since Pluto is very small relative to the other planets, its mass 
was approximated as zero in the compile time data-structures. The compiler 
propagated this piece of information throughout the program, eliminating 



3 A variant of this program was used in [Miller] to demonstrate parallel computing using 
futures. 
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numerous computations. 



3.1.3 Measurements 

Several measurements were taken to determine the effectiveness of the various 
code-generation optimizations. Tests were run for both the 6-body problem 
and the 9-body problem, 4 using both the runge-kutta (RK) and the Stormer 
(ST) integration methods. As shown in Figure 3.1, knowing the masses of the 
planets in advance allows the compiler to eliminate hundreds of instructions. 
This savings comes in part from performing portions of the computation in 
advance, and in part from taking advantage of the fact that Pluto's mass is 
zero. 



Variation in Number of Instructions 


Problem 
Description 


Mass 
Unknown 


Mass 
Known 


Savings 


6-body RK 
9-body RK 
6-body ST 
9-body ST 


2125 
4645 
1663 

2845 


2005 
4357 
1465 
2521 


120 (5.6%) 

288 (6.2%) 

198 (11.9%) 

324 (11.4%) 



Figure 3.1: Selecting the masses of the planets at compile time allows some 
computations to take place in advance, reducing the number of instructions 
that need to take place at run time. (The time-step was also chosen at 
compile time.) 

In Figure 3.2, I present measurements that were taken to determine the 
effects of the various post partial-evaluation optimizations. The results show 
that for this set of problems, traditional compiler optimizations had little 
effect. The only significant improvement was in the case of the runge-kutta 

4 In astronomy, the 6-body and 9-body problems are of particular interest. The 6-body 
problem is interesting because it includes only the outer planets and the sun, allowing 
questions of the long-term stability of the solar system to be investigated. The 9-body 
problem describes the motion of our solar system, excluding Mercury. Mercury is excluded 
because its high eccentricity necessitates the use of an extremely small integration step-size 
that makes long-term integrations impractical. 
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integrator with the time-step chosen at run time. This improvement might 
have resulted from the simplification process noticing that the derivative of 
the mass is always zero, allowing dead-code elimination to remove parts of the 
computation. However, this is purely supposition as I have not investigated 
the source of this improvement. 

The prototype compiler generates low-level programs expressed in the 
language C. Wherever possible, operations are in-line coded so as to allow the 
C compiler to make effective use of the processor registers. Storage cells are 
used to store results which will be re-used elsewhere in the program, as well 
as the inputs and outputs of the computation. Figure 3.3 shows the number 
of operations which were in-line codable, as well as the number of storage 
cells which were needed. The Stormer integrator used more storage cells 
than the the runge-kutta integrator because the Stormer integrator takes as 
input the derivatives obtained during the thirteen previous integration steps. 
Notice that for the runge-kutta integrator, register allocation techniques were 
sufficiently effective that the number of storage cells required is far less than 
the number of results that are actually stored in them. 



Effectiveness of Optimizations 


Problem 
Description 


Initial 
Operation Count 


Final 
Operation Count 


Savings due to 
Optimizations 


Time-step chosen at run time 


6-body RK 
9-body RK 
6-body ST 
9-body ST 


2206 
4732 
1480 
2542 


2011 
4363 
1477 
2539 


195 (8.4%) 

369 (7.8%) 

3 (0.2%) 

3 (0.1%) 


Time-step chosen at compile time 


6-body RK 
9-body RK 
6-body ST 
9-body ST 


2008 
4360 
1466 
2522 


2005 
4357 
1465 
2521 


3 (0.15%) 
3 (0.07%) 
1 (0.06%) 
1 (0.03%) 



Figure 3.2: Measuring the effectiveness of optimizations. RK denotes the 
runge-kutta integration method; ST denotes the stormer integration method. 
(The masses were chosen at compile time.) 
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Code Generation Statistics 



Problem 
Description 



Number of 
Operations 



# of In-line 
Codable opns 



# of re-used 
results 



# of Storage 
Cells Required 



Time-step chosen at run time 



6-body RK 
9-body RK 
6-body ST 
9-body ST 



2011 
4363 
1477 
2539 



1392 (69%) 
3057 (70%) 
1311 (89%) 
2196 (86%) 



619 (31%) 

1306 (30%) 

166 (11%) 

343 (14%) 



216 
369 
635 
973 



Time-step chosen at compile time 



6-body RK 
9-body RK 
6-body ST 
9-body ST 



2005 
4357 
1465 
2521 



1386 (69%) 
3051 (70%) 
1305 (89%) 
2187 (87%) 



619 (31%) 

1306 (30%) 

160 (11%) 

334 (13%) 



215 
368 
628 
965 



Figure 3.3: Code Generation Statistics for the N-body problem. 
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3.2 The Multipole Method Translation Op- 
erator 

3.2.1 Overview 

The multipole method is used to approximate force interactions involving 
a large number of particles. The method, as described in [Zhao], involves 
dividing space up into a quadtree-like tree of cubes. Part of the force ap- 
proximation involves propagating information up the tree from a cube to its 
parent. A significant portion of the computation time is spent evaluating 
translation operators. 

The prototype compiler was used to compile a specialized version of the 
translation operator. A Scheme implementation of this operation was taken 
from a program written primarily for people to understand. As such, the pro- 
gram does not take advantage of special cases in the multipole expansions, 
such as terms that are known to have exponents of zero or one. As shown in 
Figure 3.4, the compilation optimizations were able to detect these special 
cases, providing a significant performance improvement. The parameter P, 
which denotes the number of terms in the multipole expansions, was chosen 
at compile time. 5 In contrast to the N-body problem, post partial-evaluation 
optimizations provided significant improvements for the Translation Opera- 
tor. 



3.3 Duffing's Equation 

To demonstrate the compilation of data-dependent programs, an adaptive 
runge-kutta integrator was used to integrate a one period evolution of the 
variations and derivatives of Duffing's equation. This program was taken 
from Hal Abelson's work on automatic characterization of the state space of 

5 P = 3 is commonly used for benchmark purposes. For large P (above 10), the growth 
in code size makes compilation of the entire translation operator impractical. For large P, 
either a smaller segment could be compiled, or else some of the loops could be left intact. 
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Effectiveness of Optimizations: Translation Operator 


Problem 
Description 


Initial 
Operation Count 


Final 
Operation Count 


Savings due to 
Optimizations 


Time-step chosen at run time 


Translation, P=3 
Translation, P=6 


292 
3040 


114 
1698 


178 (60%) 
1342 (44%) 



Figure 3.4: Measuring the effectiveness of optimizations on the translation 
operator of the multipole approximation method. The compiler detected 
special cases such as exponentiation to the zero power, making it feasible 
for the program to be expressed in an elegant fashion that does not include 
checks for these special cases. 



Code Generation Statistics: Translation Operator 


Problem 
Description 


Number of 
Operations 


# of In-line 
Codable opns 


# of re-used 
results 


# of Storage 
Cells Required 


Translation, P=3 
Translation, P=6 


114 
1698 


90 (79%) 
1582 (93%) 


24 (21%) 
116 ( 7%) 


23 

115 



Figure 3.5: Code generation statistics for the Multipole Method translation 
operator. 

Duffing's equation 6 . The program, provided in Appendix B uses runtime-if 
and entry-points to express both the adaptive integrator and a control loop 
that iterates for one period. 

As I did not fully implement dead-code elimination and register allocation 
for data-dependent computations, only constant-folding and a few symbolic 
simplifications were performed. Partial evaluation produced 694 instructions, 
12 (1.7%) of which were eliminated by compiler optimizations. It is inter- 
esting to note that of the 682 remaining instructions, 639 (93.7%) form a 
data-independent region. The relatively large size of this data-independent 
region suggests that my focus on the optimization of data-independent re- 
gions is appropriate. 



5 Duffing's equation is presented in Appendix A 
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3.4 Performance Measurements 



The programs generated by the prototype compiler were integrated into 
the MIT Scheme system 7 as primitive operations. The chart in Figure 3.6 
presents performance measurements for the applications described above. 8 
For comparison, the table also shows the speed-up factor of these primitives 
relative to the original Scheme programs that they were generated from. The 
table clearly shows that specialization can provide dramatic performance 
improvements . 9 

It is also worth noting that the version of the Liar compiler which I was us- 
ing does not open-code floating-point computations. In some cases, it is pos- 
sible that implementation of this optimization could improve Liar's floating- 
point performance by as much as a factor of four. 10 However, performing this 
optimization would require the availability of type information, obtained ei- 
ther through type-inference or from declarations made by the programmer. 
This represents a fundamental difference between the two approaches: the 
conventional approach has the programmer put the declarations on variables 
in the program, imposing restrictions on the types of data that a routine can 
operate on, whereas the partial-evaluation approach allows general routines 
to be written, and then automatically specializes them based on declarations 
regarding the structure of the data for the particular problem at hand. 



Specifically, MIT CScheme release 7 with Liar compiler version 4.38, running on a 
Hewlett-Packard 9000 Series 350 with 16 Megabytes of memory. The timings presented 
do not include garbage collection time. 

8 Timings for Stormer integration of the N-body problem are not provided. A bug in 
the unix C compiler prevented the compilation of these C primitives. It is not clear why 
this problem arose, considering that the Stormer primitives are actually smaller than the 
primitives that use the runge-kutta integration method. The runge-kutta primitives had 
no trouble compiling. 

9 The Duffing's equation measurements for the specialized primitives are slow partially 
because in-line coding was not performed on the computation. 

10 It is possible that if Liar performed this optimization, the performance of the special- 
purpose primitives would also improve, as it would no longer be necessary to "unbox" the 
input floating-point numbers, or to "box" the outputs. 
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Performance Measurements 


Problem 
Desc. 


Interpreted 
CScheme 


Compiled 
CScheme 


Specialized 
Primitive 


Speed-Up over 
Interpreted 


Speed-up 
over Compiled 


6- Body RK 
9-Body RK 


1.7 
3.4 


0.76 
1.50 


0.020 
0.038 


85 
89 


38 
39 


Xlate P=3 
Xlate P=6 


0.26 
2.76 


0.022 
0.28 


0.002 
0.011 


130 
250 


11 
25 


Duffing 


26.1 


4.04 


0.53 


49 


7.6 



Figure 3.6: Timings of the sample applications. It is clear that the special- 
ized primitives are significantly faster than the Scheme programs they were 
generated from. For the N-body problem, both the time-step and the masses 
of the planets were chosen at compile time. Note that in-line coding was not 
performed on the Dufnng's equation program. 
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Chapter 4 

Parallel Computation 



Partial evaluation can play an important role in programming parallel com- 
puters. By eliminating inherently sequential data structure references and 
conditional branches, partial evaluation exposes the low-level parallelism in- 
herent in a computation. I have implemented several analysis and scheduling 
programs that detect and analyze this parallelism in data-independent com- 
putations. Without having to change the way that the algorithm is specified 
by the programmer, my programs have been able to detect virtually all of 
the parallelism available in the N-body problem. My programs also analyze 
some of the trade-offs involved in the design of a parallel architecture that 
can effectively utilize this parallelism. 

I demonstrate the effectiveness of these techniques by using the 9-body 
problem 1 as an example. I first present measurements of the theoretical 
maximum amount of parallel execution that can be attained on this problem, 
and then show how real-world design constraints such as pipelining and the 
cost of inter-processor communication limit the amount of parallelism that 
can be effectively exploited. 



Specifically, 12th-order Stormer integration of the 9-body gravitational attraction 
problem, with masses chosen at compile time, and time-step chosen at run time. 
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4.1 Parallelism Profiles 

Figure 4.1 presents a parallelism profile[Arvind] for Stormer integration of 
the 9-body problem. This profile describes the maximum amount of parallel 
execution that would occur if a computer had an infinite number of processors 
that could communicate with each other instantaneously. The profile was 
produced by re-ordering the numerical operations produced by the partial 
evaluator, such that all computations that can occur in parallel are grouped 
together into a single machine cycle. Without partial evaluation, sequential 
data-structure references would be intermixed with the numerical operations, 
hiding some of the parallelism. 

This profile differs from the parallelism profiles that commonly appear 
in the literature in that it accounts for the fact that some operations re- 
quire more cycles to complete than others. 2 To illustrate this distinction, a 
conventional parallelism profile that ignores the latency differences between 
numerical operations is presented in Figure 4.2. I have found that for com- 
putations involving double-precision floating-point operations, the latency 
differences between numerical operations are large enough to be of funda- 
mental importance. 



2 I derived relative timings for the various arithmetic operations from the latencies of the 
B3110A/B3120A floating-point chips manufactured by Bipolar Integrated Technologies. 
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Figure 4.1: Parallelism profile of the 9- body problem. This graph represents 
the total parallelism available in the problem, accounting for the latency of 
numerical operations. 
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Figure 4.2: Conventional parallelism profile of the 9-body problem. This 
graph does not account for the varying latency of numerical operations. It is 
clear from comparison with Figure 4.1 that the latency of individual numer- 
ical operations is an important factor that must be accounted for. 
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4.2 Practical Considerations 



4.2.1 Efficiency 

For the 9-body problem, fully utilizing parallelism in the fashion suggested 
by the parallelism profile would require 865 processors. Using only one pro- 
cessor, this computation requires 3094 cycles, whereas using 865 processors, 
the computation requires only 32 cycles. Thus in the ideal world where 
communication is free, parallelism improves performance by a factor of 97. 
However, in the 32 cycles it took to perform the 9-body computation, those 
same 865 processors could have performed 27,680 computations. In other 
words, these processors were used with only about 11% efficiency. From a 
cost vs. performance standpoint, this approach does not make much sense. 

A more efficient approach is to distribute the parallelism evenly among 
the machine cycles. For example, in the parallelism profile of the 9-body 
problem, there is a great deal of parallelism available in the first cycle, but 
not very much in some of the later cycles. The parallelism can be distributed 
more uniformly by moving some of the operations from the first cycle into 
later cycles. As long as these delayed operations are performed before their 
results are actually needed, the program will execute in the same amount 
of time, but will require fewer processors. In the next section, I describe a 
program that I have implemented to perform this parallelism redistribution. 



4.2.2 Pipelining 

Technological considerations often lead to overlapping the execution of suc- 
cessive instructions within a single processor. The parallelism profile analysis 
presented above was based on the assumption that the result of an instruction 
that finishes executing in one cycle can be used immediately in the following 
cycle. Unfortunately, this assumption is not valid in the presence of pipelin- 
ing. Figure 4.3 shows that for a 3-stage pipeline, the result of an instruction 
which is initiated in cycle 1 will not be available to the instruction that is 
initiated during cycle 2. Thus even with an infinite number of processors and 
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no communication delays, a machine composed of 3-stage pipelined proces- 
sors will require about twice as many cycles to execute a computation as a 
non-pipelined machine would. 3 



Instruction 1 
Instruction 2 
Instruction 3 



| Cyclt 1 CycLt 2 | Cyclt 3 | Cycl* 4 | Cyclt S | 






Load Execute Unload 




\ 




Load Execute] 


Unload 


/ 


Load 


Execute Unload 



TIME 

Figure 4.3: A typical 3-stage processor pipeline. During the LOAD stage, the 
data is loaded into the ALU. The result is computed during the EXECUTE stage, 
and unloaded from the ALU during the UNLOAD stage. The results produced by 
instruction 1 are not available to be used by instruction 2, but are available 
to instruction 3. 

Despite this increase in the number of cycles required to execute a pro- 
gram, pipelining is advantageous because it can make it possible to reduce 
the length of each cycle. In addition, it is possible to use some of the paral- 
lelism available in the problem to keep processors from falling idle. In other 
words, rather than scheduling all available parallel operations into the same 
cycle on many processors (parallelism in spacefWU 87]), it is possible to use 
a smaller number of processors, and schedule some of the operations during 
the next cycle (parallelism in time) in order to keep the pipeline busy. This 
utilizes the individual processors more effectively. 



3 Since some instructions require more than one EXECUTE cycle to compute their result, 
the processors will sometimes be busy more than half the time. My "twice as many cycles" 
estimate assumes that operations require one EXECUTE cycle to complete. 

On the other hand, the estimate also does not account for the fact that a result must 
first be unloaded from a processor before it can be loaded into another one. This creates a 
one cycle cost to moving data between processors, even when there are no communication 
delays. This effectively increases the minimum number of cycles required to complete the 
computation. Overall, these two effects tend to cancel each other out. 
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4.2.3 Communication Latency 

In practice, processors can not communicate instantaneously. The time re- 
quired to move a result from one processor to another limits how soon the 
result can be used by a subsequent instruction. This has an effect that is 
similar to increasing the length of the pipeline, as illustrated in Figure 4.4. 
To counter this effect, some of the parallelism available in the problem can 
be used to keep the processors busy while they are waiting to receive the 
results of previous computations. 



Cycl« 1 Cyclt 2 Cyclt 3 Cyclt 4 Cyclt S Cyclt 6 Cyclt 7 Cyclt 8 Cyclt 9 | Cyclt 10 



Instruction 1 

Instruction 2 

Instruction 3 

Instruction * 

Instruction 5 

Instruction & 



Load Execute! Unload 



Load ; Execut 



Comm-1 



Unload 




Load JExecute 


Unload 


Comnv 


1 


Comm-2 










Load 


Execute | Unlos 


d 


Comm-1 | Comm-2 



p 



Load iExec/tel Unload 



Comm-1 



Comm-2 i 



Load ;Executei Unload 



Comm-1 ! Comm-2 



TIME 



Figure 4.4: A 3-stage processor pipeline with a communication latency of 
two cycles. As indicated by the arrows, a result produced by instruction 1 
can be used within the same processor by instruction 3, but can not be used 
by other processors until instruction 6. 
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4.3 A Scheduler for Parallel Programs 

I have implemented a program that schedules parallel operations onto mul- 
tiple processors. Accounting for the effects of pipelining and communication 
latency, my program increases processor utilization by delaying operations 
which are not in the critical path of the computation. In other words, if the 
result of a computation is not needed right away, that computation is delayed 
until a processor is free to handle it. 



4.3.1 Performance Measurements 

Figure 4.5 shows the results of applying my scheduling program to the 9-body 
problem, for a 40 processor system with a 3-stage processor pipeline and a 
communication latency of one cycle. The figure shows how the parallelism 
available in the problem has been distributed over the life of the computation, 
so as to effectively utilize all 40 processors in most of the cycles. Overall, 
the performance improvement over that of a single pipelined processor was 
a factor of 36, indicating that the processors were used with approximately 
90% efficiency. 

The ability of the scheduler to effectively utilize the available processors 
varies with both the number of processors in the system and the communi- 
cation latency. For the 9-body problem, these variations are summarized by 
Figures 4.6 and 4.7. These graphs clearly show that communication latency 
has a direct effect on the maximum speed-up that the scheduler can provide. 
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Figure 4.5: The result of scheduling the 9-body problem onto 40 pipelined 
processors with a communication latency of one cycle. A total of 85 cycles are 
required to complete the computation. On average, 36.4 of the 40 processors 
are utilized during each cycle. 



42 



^ 50 



45 




















/ 
/ 
/ 
/ 


40 


















j 

/ 
/ 
/ 
/ 


Conuxi • 
_Latenc' ' 


35 
















/ 
/ 
/ 




1 Cyclis 
















/ 
/ 

/ it 


r La 


tency 2 


Cycles 


$0 


















/ 




La 


tency 3 


Cycles 


25 






20 






















15 






















10 






















5 












































5 10 15 20253035404550 

Number of Processors 



Figure 4.6: Scheduling of the 9-body problem: The graph shows the speed-up 
factor over a single pipelined processor. The analysis shown is for a system 
composed of processors employing a 3-stage pipeline. 
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Figure 4.7: Scheduling of the 9-body problem: The graph shows how proces- 
sor utilization efficiency varies with the number of processors in the system 
and with communication latency. Efficiency is denned as the ratio of speed- 
up factor to the total number of processors. The analysis shown is for a 
system composed of processors employing a 3-stage pipeline. 
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4.3.2 Implementation Details 

The scheduler operates by manipulating a data-flow graph. It starts by 
computing the latency of every possible path through the graph. These 
paths are then sorted, allowing the critical-path of the computation to be 
identified. When the operations are scheduled, priority is given to those 
operations that lie in the critical-path of the computation. If all available 
processors are not needed to work on the most critical-path, computations 
from less critical paths are scheduled. 

The problem of scheduling every operation onto the "best processor" 
at the "best time" is extremely difficult. Rather than trying to find an 
optimal solution to the problem, I developed a heuristic designed to select a 
"pretty good" solution. To give a flavor for the algorithm, a brief overview 
is presented below: 

• A set of operations is chosen corresponding to the number of processors 
that are available. This selection is based on the latency priorities 
described above. 

• Among the "chosen operations" , those whose operands have been avail- 
able long enough to have been transmitted to other processors have 
lower scheduling priority than those operations whose operands have 
been produced recently. This gives priority to non-relocatable compu- 
tations. 

• A computation whose operands were produced by a processor will be 
scheduled in that same processor wherever possible. 

• The number of connections between processors is kept to a minimum. 
When the operands of a computation must be transmitted from one 
processor to another, the scheduler attempts to choose a pair of pro- 
cessors that have communicated with each other in the past. 

• Several heuristics exist for breaking ties. These take into account such 
factors as the memory usage within each processor, the number of com- 
putations that are waiting for a particular result, and the frequency 
with which processors use the communication network. 
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4.3.3 Does the scheduler do a good job? 

In section 4.2.2 I showed that even with no communication latency, a collec- 
tion of 3-stage pipelined processors can at best expect to complete a compu- 
tation in approximately twice the number of cycles shown in the parallelism 
profile. In the case of the 9-body problem, the parallelism profile has 32 
cycles, so in a world of zero-cost communication and an infinite number of 
processors, we can expect the program to complete in 64 cycles, plus 2 cycles 
for loading and unloading the pipeline at the start and end of the computa- 
tion. 

For comparison purposes, I ran the scheduler on the 9-body problem for 
1000 processors with a communication latency of zero cycles. The resulting 
program required 66 cycles, which agrees exactly with my estimate of the 
maximum possible performance. Although this represents an overall speed- 
up over a single pipelined processor of a factor of 46, it is rather inefficient 
to use hundreds of processors to attain this improvement. 

In the real world, it takes time to communicate, and this can be expected 
to decrease the maximum attainable speed-up. In the zero communication 
cost situation, 1339 values were moved among the processors, 378 of which 
were used immediately. In other words, the zero- cost communication was 
taken advantage of in about 28% of the cycles. It is reasonable to expect that 
increasing the communication latency will decrease the maximum attainable 
speed-up below a factor of 46 by slowing down the arrival of messages in 
the "critical-path" of the computation. For a communication latency of one 
cycle, the scheduler is still able to provide a speed-up factor of 37, using 
only 40 processors. Given that the theoretical maximum for this problem is 
somewhere below a factor of 46, I consider the attained speed-up factor of 
37 to be a quite good, especially considering that this speed-up was attained 
while making good use of the processors (90% efficiency). 
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4.4 Routing Considerations 

The scheduler places computations in processors with a view towards min- 
imizing the routing requirements. However, unlike conventional parallel 
schedulers, it does not make any assumptions about the topology of the 
routing network. Although a processor transmits at most one result per 
processor per cycle, this result can have multiple destinations. After the 
placement has been performed, the required communication can be mapped 
onto an existing routing network, making local changes in the placement as 
needed. 

For small numbers of processors (up to about 16), it is feasible to fully in- 
terconnect the processors, allowing all communication specified by the sched- 
uler to be implemented directly with a latency of only one cycle. Pipelined 
multi-stage networks can also be used, but as shown in the speed-up graphs, 
the latency of these networks is a critical factor limiting the maximum speed- 
up that can be attained. 



4.4.1 Synthesizing a Routing Network 

The 9-body problem is of sufficient importance to justify constructing special- 
purpose computers to solve it. One such computer was in fact constructed 
several years ago in a joint Caltech/MIT project [Applegate]. Some prob- 
lems of current interest to astronomers require more computation power than 
is provided by the existing orrery, leading to the consideration of new ap- 
proaches. 

I propose that a low-latency special-purpose routing network can be con- 
structed to match the routing specifications derived by the scheduling pro- 
gram. A full interconnection of N processors requires N 2 switch-points. Since 
the placement algorithm attempts to minimize the number of destinations 
for each processor, it is advantageous to delete those switch-points that cor- 
respond to communication paths that will never be used. Figure 4.8 summa- 
rizes the number of switch-points that would be required to implement the 
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Figure 4.8: Number of switch-points required to implement a special-purpose 
routing network for the 9-body problem. As the number of processors in- 
creases, specialization provides a dramatic reduction in routing network size. 
The dashed line indicates the number of switch-points in a fully intercon- 
nected network, while the three solid lines indicate that the number of switch- 
points is relatively independent of routing latency. 
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4.5 Summary 

I have shown that partial evaluation can be used to eliminate the barriers to 
parallel execution, exposing the low-level parallelism inherent in a computa- 
tion. For the 9-body problem, I have shown that this parallelism can in fact 
provide significant performance improvements, and have presented a strat- 
egy for effectively utilizing this parallelism by scheduling the computation 
onto a collection of pipelined processors. I have also proposed a cost-efficient 
technique for constructing high-performance special- purpose computers. The 
combination of these contributions comprises a strategy for automatically 
generating and programming a special-purpose computer based on a high- 
level language specification. 



4.5.1 Relation to Previous Work 

Many compilers for high-performance architectures use program transforma- 
tions to exploit low-level parallelism. For instance, compilers for vector ma- 
chines unroll loops to help fill vector registers. [Padua] Similarly, compilers for 
VLIW architectures use trace-scheduling[E\lis] to guess which way a branch 
will go, allowing computations beyond the branch to occur in parallel with 
those that precede the branch. The effectiveness of both of these techniques 
is limited by their preservation of the user data-structures of the original 
program. In other words, if the original program represented an object as a 
vector of vectors, the compiled program will do so as well. This preservation 
of user data-structures imposes synchronization requirements which reduce 
the low-level parallelism available to the compiler. 

My work focuses on parallelizing those regions of a program which it is 
feasible to evaluate symbolically at compile-time. This compile-time eval- 
uation allows user data-structures and many conditionals to be eliminated, 
producing purely numerical programs that contain extremely large basic- 
blocks. This approach makes it possible to use intermediate results im- 
mediately in portions of a program that would not otherwise have been 
reached even through trace- scheduling, due to conditional tests related to 
user data-structures. This technique is orthogonal to the trace-scheduling 
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approach: symbolic-evaluation can be used to eliminate conditional tests re- 
lated to data-structures, producing large parallelizable basic-blocks, while 
trace-scheduling can be used to optimize across basic-block boundaries. 



4.5.2 Suggestions for Future Work 

The technique of using symbolic-evaluation to expose parallelism within data- 
independent regions may be combined with other parallel-programming ap- 
proaches. For example, this technique can be combined with the futures 
approach of MultiScheme [Miller] by using symbolic evaluation to parallelize 
computations within a future, thereby allowing futures to be used to pro- 
gram a collection of parallel computers. Similarly, this technique may be 
used in conjunction with hardware that does dynamic scheduling of data-flow 
graphs: symbolic-evaluation can be used to create large statically-analyzable 
nodes within a dynamic data-flow graph. Rather than connecting a collec- 
tion of relatively simple processors, the parallelism available within each of 
these statically-analyzable nodes makes it feasible to use dynamic-scheduling 
hardware to combine a collection of more powerful (parallel) processors. 
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Chapter 5 



Conclusions 



Partial evaluation has an important role to play in the compilation of numerically- 
oriented scientific computations. When sufficient information is available at 
compile time, the specialization and data abstraction elimination capabili- 
ties of partial evaluation provide significant performance improvements over 
conventional compilation techniques. These improvements leave the pro- 
grammer free to write abstract, elegant programs that express ideas, leaving 
the optimization of special cases as a task for the compiler. 

Using placeholders to propagate intermediate results has proven to be 
a simple and elegant technique for performing partial evaluation. The pro- 
totype compiler I implemented does a fairly good job of optimizing data- 
independent regions. A good starting point for future work would be to im- 
prove the techniques for combining data-independent regions to form data- 
dependent programs. For example, by using lexical analysis techniques, a 
compiler could identify the potential side-effects and lexical references of a 
program. Those data structures which are side-effected could be created 
and referenced at run time. Placeholders could then be used to represent 
run-time data structures as well as numerical values. 

Partial evaluation exposes the low-level parallelism inherent in a compu- 
tation. I have implemented a scheduler that takes advantage of this paral- 
lelism to make efficient use of a collection of pipelined processors. I have 
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also shown that communication latency is a critical factor that limits the 
maximum speedup that can be attained through parallel execution. I have 
shown that a special- purpose routing network can efficiently provide the low- 
latency communication required for a particular problem at a reasonable cost. 
A good starting point for future work would be to investigate routing net- 
works that provide for a mixture of low and high latency communication. 
For example, by exploiting locality, it is possible for a network to provide 
some single cycle latency communication paths, together with some multiple 
cycle paths for long-distance messages. 

The combination of the performance gains provided by parallel execution 
with the order-of-magnitude performance improvement provided by program 
specialization promises a fundamental improvement in the performance of an 
important class of abstractly specified numerical programs. I believe that a 
production quality compiler based on the techniques that I have described 
would lead to a fundamental change in the way that scientists write programs. 
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Appendix A 

The Dufflng's Equation 
Program 



; ; ;UEDA-VAR-DER-K is the function being integrated in the examples 
; ; ; in Chapter 3 . 

(define (ueda-var-der-k b k) 
(let ((k- (- k))) 
(lambda (state) 

(let ((t (vector-ref state 0)) 
(x (vector-ref state 1)) 
(u (vector-ref state 2)) 
(del/x/x (vector-ref state 3)) 
(del/u/x (vector-ref state 4)) 
(del/x/u (vector-ref state 5)) 
(del/u/u (vector-ref state 6)) 
(dx/dk (vector-ref state 7)) 
(du/dk (vector-ref state 8))) 
(let ((gl (* -3 x x))) 
(vector 
1 
u 
(- (* B (cos t)) (* k u) (* x x x)) 
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;;; The input data structure for the Duffing 's Equation example: 

(define varder-k-state-vector 
(vector (make-placeholder 't) 
(make-placeholder 'x) 
(make-placeholder 'u) 
(make-placeholder 'del/x/x) 
(make-placeholder 'del/u/x) 
(make-placeholder ' del/x/u) 
(make-placeholder 'del/u/u) 
(make-placeholder 'dx/dk) 
(make-placeholder Mu/dk))) 

; ; ; The actual sequence of code that was compiled: 

(end-state (ueda-var-der-k b k) 

varder-k-state-vector) 
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Appendix B 

Quality Controlled 
Runge-Kutta Integrator 



; ; ;The following code is an example of the use of runtime-if and 
; ; ; entry-points to implement an adaptive runge-kutta integrator . 

; ; ; It was derived from an integrator written by Hal Abelson and 
;;;Gerry Sussman as part of the dynamicists workbench. 

; ; ;The library of vector operations is omitted. 



;;;0ne runge-kutta integration step 
(define (rkstep der dydx state h) 
(let* ((h* (scale-vector h)) 
(kO (h* dydx)) 

(kl (h* (der (add-vectors state (1/2* kO))))) 
(k2 (h* (der (add-vectors state (1/2* kl))))) 
(k3 (h* (der (add-vectors state k2))))) 
(add-vectors state 

(1/6* (add-vectors (add-vectors kO (2* kl)) 

(add-vectors (2* k2) k3)))))) 
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(define (quality-control stepper order) 
(let ((2*order (expt 2 order)) 

(error-scale (/ -1 (+ order 1)))) 
(let ((half weight (scale-vector (/ 2"order (- 2"order 1)))) 
(fullweight (scale-vector (/ 1 (- 2~order 1))))) 
(lambda (der) 

(lambda (state h-init continue) 
(let ((dydx (der state))) 
(define (loop h) 

(DEFINE-ITERATIVE-ENTRY-POINT ((h h)) 
(lambda (return-to-loop) 

(fluid-let (((access arithmetic-error circuit -package) 

(lambda () ;;;IF OVERFLOW, TRY SMALLER STEPSIZE 
(RETURN-TO-LOOP (/ h 4))))) 
(let* ((h/2 (/ h 2)) 

(fullstep (stepper der dydx state h)) 

(half step (stepper der dydx state h/2)) 

(2half steps (stepper der (der half step) half step h/2)) 

(diff (sub-vectors 2half steps fullstep)) 

(err (/ (maxnorm 

((elementwise (lambda (y d) 
(/ d 

(+ qc- error- st op 
(abs y))))) 
2half steps 
diff)) 
qc-allowable-error) ) ) 
(RUNTIME- IF 
(> err 1) 
(RETURN-TO-LOOP 

(* qc-safety h (expt err error-scale))) 
(let ((newh (* qc-safety h (expt err error-scale))) 
(new-state 
(sub-vectors (halfweight 2halfsteps) 
(fullweight fullstep)))) 
(continue new-state newh))) 

))))) 
(loop h-init))))))) 
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(define (integrate-f or- interval system-derivative 

initial-state 
interval 
h 

. integ) 
(let ((integrator (if (null? integ) default-integrator (car integ)))) 
(let ((integration-step (integrator system-derivative)) 

(final-time (+ interval (state-vector-time initial-state)))) 
(DEFINE-ITERATIVE-ENTRY-POINT ((h-left interval) 

(h-next h) 

(state initial-state)) 
(lambda (STEPLOOP) 

(RUNTIME-IF (<• h-left 0) 
state 

(integration-step 
state 

(min h-next h-left) 
(lambda (next-state next-h) 
(steploop (- final-time 

(state-vector-time next-state)) 
(RUNTIME-IF « next-h h) 
next-h 
h) 
next-state))))))))) 
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Appendix C 

The N-body Program 



; ; ; This program solves the N-body Problem using the gravitation force 

;;; law with a runge-kutta-4 integrator. It was written by G.J. Sussman. 

;;; Code for the stormer integrator is not shown. 

This program was originally written to allow execution on 
parallel processors - no assumptions are made as to the ordering 
of elements within a list. It has become a fairly standard benchmark 
in the local community. 
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; Integration of orbits 

INTEGRATE-SYSTEM takes an initial system state and a time step, H. 
it produces a stream of future states. A system state is a time, 
and a set of particles. Each particle has a name, a mass, a position 
and a velocity. 

(define (integrate-system initial-state h) 

(let ((integrator (runge-kutta-4 (particle-force gravitation) h))) 
(define (next state) 
(cons state 

(delay (next (integrator state))))) 
(next initial-state))) 

(define (integration-step initial-state h) 

(let ((integrator (runge-kutta-4 (particle-force gravitation) h))) 
(integrator initial-state))) 

Runge-Kutta takes a function, F, that produces a system derivative 
from a system state. Runge-Kutta produces a function that takes a 
system state and produces a new system state. 

(define (runge-kutta-4 f h) 
(define h* (scale-system h)) 
(define 2* (scale-system 2)) 
(define 1/2* (scale-system (/ 1 2))) 
(define 1/6* (scale-system (/ 1 6))) 
(lambda (y) ; y is a system state 

(define kO (h* (f y))) 

(define kl (h* (f (add-systems y (1/2* kO))))) 

(define k2 (h* (f (add-systems y (1/2* kl))))) 

(define k3 (h* (f (add-systems y k2)))) 

(add-systems y (1/6* (add-systems kO (2* kl) (2* k2) k3))))) 
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Given a system state PARTICLE-FORCE produces a system derivative. 
A system derivative is a time increment and a set of differential 
particles. A differential particle has a name, a differential of 
mass, a differential of position and a differential of velocity. 

(define (particle-force force-law) 
(define (accelerations bodies) 

(let ((p (car bodies)) (r (cdr bodies))) 

(if (null? (cdr r)) ;2-body problem 

(let ((inc (force-law p (car r)))) 

(list (car inc) (cdr inc))) 
(let ((incs 

(map (lambda (other) 

(force-law p other)) 
r))) 
(cons (reduce add (map car incs)) 
(map add 

(map cdr incs) 
(accelerations r) )))))) 
(lambda (system-state) 

(make-system 1 ; dt/dt 

(map (lambda (p a) 

(make-particle (name p) 

; dm/dt 

(velocity p) 
a)) 
(particles system-state) 
(accelerations (particles system-state)))))) 

(define (gravitation pi p2) 

(let ((dx (sub (position pi) (position p2)))) 

(let ((rcube (cube (sqrt (reduce + (map square dx)))))) 
(cons (scalar (/ (* -G (mass p2)) rcube) dx) 

(scalar (/ (* G (mass pi)) rcube) dx))))) 
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; ; ; ; Orbital mechanics data structures 



(define type car) 

A system state is a time and a set of particles. 
Each particle has a name, a mass, and a dynamical description 
in either rectangular or element coordinates. 

(define (make-system time particles) 

(list '(system-state) time particles)) 
(define time cadr) 
(define particles caddr) 

(define (operate-system f system) 

(make-system (time system) (map f (particles system)))) 

(define (operate-planets f system) 
(operate-system (lambda (s) 

(if (eq? (name s) 'sun) s (f s))) 
system) ) 

(define (find-particle n system) 

(lookup (lambda (p) (eq? (name p) n)) (particles system))) 

(define (scale-system scale-factor) 

(define sp (scale-particle scale-factor)) 
(lambda (system) 

(make-system (* scale-factor (time system)) 
(map sp (particles system))))) 

(define (add-systems . systems) 

(make-system (reduce + (map time systems)) 

(map (lambda (bunch) (reduce add-particles bunch)) 
(group name (map particles systems))))) 
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; ; ; Particles 

(define (scale-particle factor) 
(define factor* (scale factor)) 
(lambda (particle) 

(assert (memq 'rectangular (type particle)) "Not rectangular") 
(list (type particle) 
(name particle) 
(* factor (mass particle)) 
(factor* (position particle)) 
(factor* (velocity particle))))) 

(define ( add-part icles . particles) 
(for-each (lambda (p) 

(assert (memq 'rectangular (type p)) "Not rectangular")) 
particles) 
(list (type (car particles)) 
(name (car particles)) 
(reduce + (map mass particles)) 
(reduce add (map position particles)) 
(reduce add (map velocity particles)))) 
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; ; ; ; Math library 

; ; ; This is a standard math library - not all of the functions are 
; ; ; used by the N-body program. 

(define (acos w) 

(define epsilon 1.0e-13) 
(cond ((< (abs w) 1) 

(atan (sqrt (- 1 (square w))) w)) 
((> (- (abs w) 1) epsilon) 

(error "ACOS — argument > 1: " w)) 
((>- w 1) 0) 
(else pi))) 

; ; ; Functional operators 

(define ((bracket . fl) . x) 
(map (lambda (f) (apply f x)) 
fl)) 

(define ((apply-to-all f) x) 
(map f x)) 

(define (reduce proc args) 
(if (null? (cdr args)) 
(car args) 
(proc (car args) 

(reduce proc 

(cdr args))))) 
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(define (reduction f make-f-identity) 
(define (reduce 1) 

(cond ((null? 1) (make-f-identity)) 
((* (length 1) 1) (car 1)) 
(else (f (car 1) (reduce (cdr 1)))))) 
(lambda 1 
(reduce 1))) 

(define ((compose . fl) x) 
(if (null? fl) 
x 
((car fl) ((apply compose (cdr fl)) x)))) 

(define (fixed-point f start) 
(define epsilon 5.0e-13) 
(define (good- enough? new start) 
(< (/ (abs (- new start)) 

(max (abs new) (abs start) epsilon)) 
epsilon)) 
(let loop ((old start) (new (f start))) 
(if (good-enough? new old) 
new 
(loop new (f new))))) 

(define (for-each-distinct-pair proc list) 
(if list 

(let loop ((first (car list)) (rest (cdr list))) 
(for-each (lambda (other-element) 

(proc first other-element)) 
rest) 
(if rest (loop (car rest) (cdr rest)))))) 
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; ; ; Matrices 

(define make-matrix list) 

(define (matrix*vector m v) 

(map (lambda (v2) (dot v v2)) m)) 

(define (transpose 11) 
(apply map list 11)) 

(define (2-matrix-multiply ml m2) 
(let ((t (transpose m2))) 

(map (lambda (vl) ; (matrix*vector t vl) 

(map (lambda (v2) (dot vl v2)) t)) 
ml))) 

(define (matrix-multiply . m) 
(reduce 2-matrix-multiply m)) 

(define (2-rotation angle) 

(let ((s (sin angle)) (c (cos angle))) 
(make-matrix (3-vector c (- s)) 

(3-vector s c)))) 



; ; ; Geometry 

(define (make-x-rotation angle) 

'((1 0) 

(0 ,(cos angle) ,(- (sin angle))) 

(0 ,(sin angle) ,(cos angle)))) 

(define (make-y-rotation angle) 

'((,(cos angle) ,(sin angle)) 

(0 1 0) 

(,(- (sin angle)) ,(cos angle)))) 

(define (make-z-rotation angle) 
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'((,(cos angle) ,(- (sin angle)) 0) 
(,(sin angle) ,(cos angle) 0) 
(0 1))) 

; ; ; Angles 

(define pi (* 4 (atan 11))) 
(define 2pi (* 2 pi)) 
(define pi/2 (/ pi 2)) 

(define radians-per-arc-second 
(/ 2pi 60 60 360)) 

(define radians -per-degree 
(/ 2pi 360)) 

(define (principal-value angle) 
(let ((na (/ angle 2pi))) 
(* 2pi (- na (round na))))) 

; ; ; Vectors 

(define (square x) (* x x)) 
(define (cube x) (* x x x)) 

(define 3-vector list) 
(define x-coord car) 
(define y-coord cadr) 
(define z-coord caddr) 

(define (map-vector f) 
(define (mapv . vl) 
(if (pair? (car vl)) 
(apply map mapv vl) 
(apply f vl))) 
mapv) 



(define zero-vector (3-vector 0)) 
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(define add (map-vector +)) 
(define sub (map-vector -)) 
(define (scale c) (map-vector (lambda (x) (* c x)))) 

(define (scalar c v) 
((scale c) v)) 

(define (magnitude v) 

(sqrt (apply + (map square v)))) 

(define (dot vl v2) 

(apply + (map * vl v2))) 

(define (cross vl v2) 
(3-vector 
(- (* (y-coord vl) (z-coord v2)) (* (y-coord v2) (z-coord vl))) 
(- (* (z-coord vl) (x-coord v2)) (* (z-coord v2) (x-coord vl))) 
(- (* (x-coord vl) (y-coord v2)) (* (x-coord v2) (y-coord vl))))) 

(define (vector-angle vl v2) 

(let ((epsilon 1.0e-13) (dp (dot vl v2)) (tl (cross vl v2))) 
(let ((m (magnitude tl))) 

(cond ((and (< m epsilon) (< dp epsilon)) 

(error "Vector angle is numerically ill-defined." vl v2)) 
(else (atan m dp)))))) 

(define (signed- vector-angle vl v2 z) 

;; this computes the signed angle from vl to v2. 
(let ((epsilon 1.0e-13) (dp (dot vl v2)) (tl (cross vl v2))) 
(let ((m (magnitude tl))) 

(cond ((and (< m epsilon) (< dp epsilon)) 

(error "Vector angle is numerically ill-defined." vl v2 z)) 
((< (dot z tl) 0) (- (atan m dp))) 
(else (atan m dp)))))) 
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; ; ; Stream procedures . 

(define head car) 
(define (tail s) (force (cdr s))) 
(define empty-stream? null?) 
(define the-empty- stream '()) 

(define (apply-to-stream f . streams) 
(let loop ((s streams)) 

(if (not (apply *or (map empty-stream? s))) 
(sequence 

(apply f (map head s)) 
(loop (map tail s)))))) 

(define (map-stream f . streams) 
(let loop ((s streams)) 

(if (not (apply *or (map empty-stream? s))) 
(cons (apply f (map head s)) 

(delay (loop (map tail s)))) 
the-empty-stream) ) ) 

(define (multiples-of n h) 
(cons (* n h) 

(delay (multiples-of (1+ n) h)))) 
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; ; ; ; General utilities 

(define false (= 1 0)) 
(define true (=0 0)) 

(define (*or . disjuncts) 

(cond ((null? disjuncts) false) 
((car disjuncts) true) 
(else (apply *or (cdr disjuncts))))) 

(define (*and . conjunct s) 

(cond ((null? conjuncts) true) 

((car conjuncts) (apply *and (cdr conjuncts))) 
(else false))) 

(define (assert x? . m) 
(if (not x?) 

(apply error m))) 

(define (lookup pred? 1) 
(cond ((null? 1) '()) 

((pred? (car 1)) (car 1)) 
(else (lookup pred? (cdr 1))))) 

(define (translate compare from-list to-list key) 
(let loop ((from from-list) (to to-list)) 

(cond ((null? from) (error "Can't find" key "in" from-list)) 
((compare key (car from)) (car to)) 
(else (loop (cdr from) (cdr to)))))) 
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; Multisets 

GROUP forms the set of groups of elements from SETS such that all 
of the elements in each group have the same value of the KEY. It is 
presumed that each set has a unique element with each key and each key 
appears in every set. 

(define (group key sets) 

(let ((n (- (length sets) 1))) 

(define (new-frob frob sets accum) 
(let ((bin (lookup (lambda (x) 

(eq? (key frob) (key (car x)))) 
accum) ) ) 
(if bin 

(if (= (length bin) n) 

(adjoin (cons frob bin) 

(scan-out sets (delete bin accum))) 
(scan-out sets 

(adjoin (cons frob bin) 

(delete bin accum)))) 
(scan-out sets (adjoin (list frob) accum))))) 
(define (scan-out sets accum) 
(if (empty? sets) 
(empty-set) 
(if (empty? (first sets)) 

(scan-out (rest sets) accum) 
(new-frob (first (first sets)) 

(adjoin (rest (first sets)) (rest sets)) 
accum) ) ) ) 
(scan-out (apply make-set sets) ; coerce list to multiset, 
(empty-set)))) 



MAP-DISTINCT-PAIRS applys a function, F, to every distinct pair 
of values chosen from the multiset, M, producing a multiset of the 
results . 

(define (map-distinct-pairs f mset) 
(map (lambda (p) (apply f p)) 
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(di«tinct-p*ir» m»t))) 



(dif ia« (distinct-ptirt wet) 
(if (wptyT aMt> 
(w pty i tt ) 
(li« «f (*iwt aMt)) 

(r (d it ia c t-flr» (s«vt Hit)))) 
(l«t loop Ulmtt (rmt ■*•«))) 
(If (it j? lft) 



(adjoin (lUt f (fir*t Wft» 

(loop (***t l«ft)))»») 



(d«f iM Mk«-nt list) 
(d*f la* t istt ecr) 
(dafiM net «i*) 
(d*f in* *dj«ia ew) 



(d«f ia* MptyT waiT) 
(d«fin« («q*y*««tt) '()) 
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; ; ; Initial conditions commonly used as a starting point 
; ; ; for solar system integrations 

(define SS 
(let () 

Heliocentric equatorial rectangular coordinates 1950.0 

From Schubart and Stumpff (1966) 

Note that velocities are entered in AU/40 days but converted to 

AU/day on system construction. Positions are in AU. Masses are in Msun. 

(define sun 

(make-rectangular-heliocentric 'sun (+ 1 (/ 1 6000000)) ; +Mercury 

(3-vector 0) 

(3-vector 0))) 

(define venus 

(make-rectanguiar-heliocentric 'venus (/ 1 408000) 
(3-vector -.5113942959 -.4780976854 -.1830874810) 
(3-vector .5663768182 -.5120871589 -.2664978745))) 

(define earth 

(make-rectangular-heliocentric 'earth (/ 1 329390) 
(3-vector -.2614989917 .8696237687 .3771652157) 
(3-vector -.6746573183 -.1700948008 -.07377431920))) 

(define mars 

(make-rectangular-heliocentric 'mars (/ 1 3093500) 
(3-vector -1.295477589 -.8414136141 -.3513513446) 
(3-vector .3440042605 -.3696674843 -.1789373952))) 

(define j up iter 

(make-rectangular-heliocentric 'jupiter (/ 1 1047.355) 
(3-vector 3.429472643 3.353869719 1.354948917) 
(3-vector -.2228647739 .2022768826 .09223051780))) 

(define saturn 

(make-rectangular-heliocentric 'saturn (/ 1 3501.6) 
(3-vector 6.641453441 5.971569844 2.182315015) 
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(3-vector -.1662288590 .1462712350 .06765636470))) 

(define uranus 

(make-rectangular-heliocentric 'uranus (/ 1 22869) 
(3-vector 11.26304125 14.69525888 6.279605833) 
(3-vector -.1301308165 .07588051779 .03508967792))) 

(define neptune 

(make-rectangular-heliocentric 'neptune (/ 1 19314) 
(3-vector -30.15522934 1.657000860 1.437858110) 
(3-vector -.009619598984 -.1150657040 -.04688875226))) 

(define pluto 

(make-rectangular-heliocentric 'pluto 
(3-vector -21.12383780 28.44651101 15.38826655) 
(3-vector -.07074485007 -.08655927220 -.005946850713))) 

(define (convert-to-AU/day system) ; only for rect helio 
(operate-system (lambda (particle) 

(make-rectangular-heliocentric 
(name particle) 
(mass particle) 
(position particle) 

((scale (/ 1 40)) (velocity particle)))) 
system)) 

(convert-to-AU/day 
(make-system 

(make-set sun venus earth mars jupiter saturn uranus neptune pluto))))) 
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